591 TFLOPS Multi-trillion Particles Simulation on SuperMUC
نویسندگان
چکیده
Anticipating large-scale molecular dynamics simulations (MD) in nano-fluidics, we conduct performance and scalability studies of an optimized version of the code ls1 mardyn. We present our implementation requiring only 32 Bytes per molecule, which allows us to run the, to our knowledge, largest MD simulation to date. Our optimizations tailored to the Intel Sandy Bridge processor are explained, including vectorization as well as shared-memory parallelization to make use of Hyperthreading. Finally we present results for weak and strong scaling experiments on up to 146016 Cores of SuperMUC at the Leibniz Supercomputing Centre, achieving a speed-up of 133k times which corresponds to an absolute performance of 591.2 TFLOPS.
منابع مشابه
The Performance of the Intel TFLOPS Supercomputer
The purpose of building a supercomputer is to provide superior performance on real applications. In this paper, we describe the performance of the Intel TFLOPS Supercomputer starting at the lowest level with a detailed investigation of the Pentium® Pro processor and the supporting memory subsystem. We follow this with a description of the benchmarks used to track the performance of the machine ...
متن کاملAn Overview of the Intel TFLOPS Supercomputer
Computer simulations needed by the U.S. Department of Energy (DOE) greatly exceed the capacity of the world’s most powerful supercomputers. To satisfy this need, the DOE created the Accelerated Strategic Computing Initiative (ASCI). This program accelerates the development of new scalable supercomputers and will lead to a supercomputer early in the next century that can run at a rate of 100 tri...
متن کاملI/O for TFLOPS Supercomputers
Scalable parallel computers with TFLOPS (Trillion FLoating Point Operations Per Second) performance levels are now under construction. While we believe TFLOPS processor technology is sound, we believe the software and I/O systems surrounding them need improvement. This paper describes our view of a proper system that we built for the nCUBE parallel computer and which is now commercially availab...
متن کاملHigh-Performance Small-Scale Simulation of Star Clusters Evolution on Cray XD1
In this paper, we describe the performance of an N -body simulation of star cluster with 64k stars on a Cray XD1 system with 400 dual-core Opteron processors. A number of astrophysical N -body simulations were reported in SCxy conferences. All previous entries for Gordon-Bell prizes used at least 700k particles. The reason for this preference of large numbers of particles is the parallel effici...
متن کاملScaling of the GROMACS 4.6 molecular dynamics code on SuperMUC
Here we report on the performance of GROMACS 4.6 on the SuperMUC cluster at the Leibniz Rechenzentrum in Garching. We carried out benchmarks with three biomolecular systems consisting of eighty thousand to twelve million atoms in a strong scaling test each. The twelve million atom simulation system reached a performance of 49 nanoseconds per day on 32,768 cores.
متن کامل